Holo-Entropy Based Categorical Data Hierarchical Clustering
نویسندگان
چکیده
Clustering high-dimensional data is a challenging task in data mining, and clustering high-dimensional categorical data is even more challenging because it is more difficult to measure the similarity between categorical objects. Most algorithms assume feature independence when computing similarity between data objects, or make use of computationally demanding techniques such as PCA for numerical data. Hierarchical clustering algorithms are often based on similarity measures computed on a common feature space, which is not effective when clustering highdimensional data. Subspace clustering algorithms discover feature subspaces for clusters, but are mostly partition-based; i.e. they do not produce a hierarchical structure of clusters. In this paper, we propose a hierarchical algorithm for clustering high-dimensional categorical data, based on a recently proposed information-theoretical concept named holo-entropy. The algorithm proposes new ways of exploring entropy, holo-entropy and attribute weighting in order to determine the feature subspace of a cluster and to merge clusters even though their feature subspaces differ. The algorithm is tested on UCI datasets, and compared with several state-of-the-art algorithms. Experimental results show that the proposed algorithm yields higher efficiency and accuracy than the competing algorithms and allows higher reproducibility.
منابع مشابه
The "Best K" for Entropy-based Categorical Data Clustering
With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since categorical data does not have the inherent distance function ...
متن کاملTowards Finding Optimal Partitions of Categorical Datasets
A considerable amount of work has been dedicated to clustering numerical data sets, but only a handful of categorical clustering algorithms are reported to date. Furthermore, almost none has addressed the following two important cluster validity problems: (1) Given a data set and a clustering algorithm that partitions the data set into k clusters, how can we determine the best k with respect to...
متن کاملارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها
Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...
متن کاملDetecting the Change of Clustering Structure in Categorical Data Streams
Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we present a framework for detecting the change of critical clustering structure in categorical dat...
متن کاملClustering Numerical and Categorical Data
Clustering is an important technique for data mining which allows us to discover unknown relationships in our data sets. Clustering algorithms that use metrics based on the natural ordering of numbers cannot be applied to categorical (non-numerical) data. In this tutorial we will review the main methods for numerical data clustering (K-Means, Hierarchical Clustering and Fuzzy CMeans) and then s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Informatica, Lith. Acad. Sci.
دوره 28 شماره
صفحات -
تاریخ انتشار 2017